Omics data

With a focus on high-throughput sequencing data

Jelmer Poelstra

CFAES Bioinformatics Core, OSU

2025-08-26

An overview of omics data

The main omics data types

Copyright ThermoFisher

The main omics data types

  • Genomics (including metagenomics) DNA
  • Epigenomics DNA modifications
  • Transcriptomics RNA
  • Proteomics Proteins
  • Metabolomics Metabolites

Note

Should be large-scale, e.g. “genomics” is largely at the “whole-genome” level.

Both genomics and transcriptomics data, in the broad definitions above, is produced by high-throughput sequencing technologies.

That will be the focus of this lecture and will be used in examples throughout the course.

Intro to sequencing technologies

What does sequencing refer to?

The shorthand sequencing, like in “high-throughput sequencing”, generally refers to determining the nucleotide sequence of fragments of DNA.


What about RNA or proteins?

  • RNA is usually reverse transcribed to DNA (cDNA) prior to sequencing, as in nearly all “RNA-seq”.

    Direct RNA sequencing is possible with one of the sequencing technologies we’ll discuss, but this is under development and not yet widely used.


  • Protein sequencing requires different technology altogether, such as mass spectrometry, and is not further discussed in this lecture.

Sequencing technologies: overview

Sanger sequencing (since 1977)
Sequences a single, typically PCR-amplified, short-ish (≤900 bp) DNA fragment at a time


High-throughput sequencing (HTS, since 2005)
Sequences hundreds of thousand to billions, usually randomly selected, DNA fragments at a time


Sequenced DNA fragments are referred to as “reads”.

Sequencing cost through time

https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost

High-throughput sequencing (HTS)

Examples of HTS applications

  • Whole-genome assembly
  • Variant analysis (for population genetics/genomics, molecular evolution, GWAS, etc.):

    • Whole-genome “resequencing”

    • Reduced-representation libraries (e.g. RADseq, GBS)

  • RNA-seq (transcriptome analysis)
  • Other functional sequencing methods like methylation sequencing, ChIP-seq, etc.
  • Microbial community characterization

    • Metabarcoding

    • Shotgun metagenomics

HTS applications (cont.)

Examples of HTS analyses

  • Algorithmic/bioinformatics stage:
    • Read QC
    • Read trimming
    • Read alignment and classification
    • Read assembly
    • Genotype calling

  • Biostatistical stage:
    • Differential abundance among groups
    • Clustering/ordination and network analyses
    • GWAS
    • Functional enrichment

The main HTS technologies

Short-read HTS

  • Produces up to billions of 50-300 bp highly accurate reads

  • Market dominated by Illumina

  • Since 2005 — technology fairly stable

  • (AKA Next-Generation Sequencing - NGS)

Long-read HTS

  • Reads much longer than in NGS but fewer, less accurate, and more costly per base
  • Mainly Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio)
  • Since 2011 — remains under rapid development


Short videos explaining the technology (90 s - 5 m each)

Read lengths

  • Short-read (Illumina) HTS: 50-300 bp reads
  • Long-read HTS: longer & more variable read lengths (PacBio: 10-50 kbp, ONT: 10-100+ kbp)

When are longer reads useful?
  • Genome assembly

  • Haplotype and large structural variant calling

  • Transcript isoform identification

  • Taxonomic identification of single reads (microbial metabarcoding)


When does read length not matter (as much)?
  • SNP variant analysis

  • Read-as-a-tag: the goal is just to know a read’s origin in a reference genome, like in counting applications such as RNA-seq

Error rates

Currently, no sequencing technology is error-free.

  • Illumina error rates are mostly below 0.1%
  • TBA

Error rates are changing

Error rates in one recent type of PacBio sequencing where individual fragments are sequenced multiple times (“HiFi”) are now lower than in Illumina.

Error rates of ONT sequencing are also continuously decreasing.


Quality scores in sequence data

When you get sequences from a high-throughput sequencer, base calls have typically already been made. Every base is also accompanied by a quality score (inversely related to the estimated error probability). We’ll talk about those in some more detail in a bit.

Illumina libraries and sequencing

Libraries and library prep

We will talk a but about Illumina library prep because this is the most common type of sequencing, and because throughout the course, we will use Illumina read files as examples.

In a sequencing context, a “library” is a collection of nucleic acid fragments ready for sequencing.


In Illumina and other HTS libraries, these fragments number in the millions or billions and are often randomly generated from input such as genomic DNA:

An overview of the library prep procedure. This is typically done for you by a sequencing facility or company.

A closer look at the processed DNA fragments

As shown in the previous slide, after library prep, each DNA fragment is flanked by several types of short sequences that together make up the “adapters”:



Multiplexing!

Using the indices/barcodes in adapters, up to 96 samples can be multiplexed into a single library.

Paired-end vs. single-end sequencing

DNA fragments can be sequenced from both ends as shown below —
this is called “paired-end” (PE) sequencing:


When sequencing is instead single-end (SE), no reverse read is produced:

Insert size variation

The size of the DNA fragment can vary – both by design and because of limited precision in size selection. In some cases, it is:

  • Shorter than the combined read length, leading to overlapping reads (this can be useful):

  • Shorter than the single read length, leading to “adapter read-through” (i.e., the ends of the resulting reads will consist of adapter sequence, which should be removed):

The sequencing process

https://www.youtube.com/watch?v=fCd6B5HRaZ8

Reference genomes

Genomes

Many HTS applications either require a “reference genome” or involve its production.


What exactly does “reference genome” refer to? It usually includes:

  • Assembly
    A representation of most of the genome DNA sequence: the genome assembly

  • Annotation
    The “annotation” that provides the locations of genes and other genomic features, as well as functional information on these features

Taxonomic identity

Reference genomes are typically needed and used at the species level.

  • If needed, it is often possible to work with reference genomes of closely related species
  • Conversely, multiple reference genomes may exist, e.g. for different subspecies

Sequence file types

Overview

All common seqeunce/genomic data files are plain-text. The main types are:

  • FASTA
    Simple sequence files, where each entry contains a header and a DNA/AA sequence.
    Versatile, can contain a few short sequences, entire genome assemblies, proteomes, and alignments.

  • FASTQ
    The standard format for HTS reads — contains a quality score for each nucleotide.

  • SAM/BAM
    An alignment format for HTS reads.


  • GTF/GFF
    Tables (tab-delimited) with information such as genomic coordinates on “genomic features” such as genes and exons. The files contain reference genome annotations.

FASTA files

FASTA files contain one or more DNA or amino acid sequences, with no limits on the number of sequences or the sequence lengths.

The following example FASTA file contains two entries:

>unique_sequence_ID Optional description
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA
>unique_sequence_ID2
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAATG

Each entry contains a header and the sequence itself, where:

  • Header lines start with a > and provide identifying information for the sequence
  • The sequence is often spread across multiple lines with a fixed width

FASTQ

FASTQ is the standard format for HTS reads.
Each read forms one FASTQ entry and is represented by four lines, which contain, respectively:

  1. A header that starts with @ and e.g. uniquely identifies the read
  2. The sequence itself
  3. A + (plus sign)
  4. One-character quality scores for each base (hence FASTQ as in “Q” for “quality”)

FASTQ quality scores

The quality scores we saw in the read on the previous slide represent an estimate of the error probability of the base call.

Specifically, they correspond to a numeric “Phred” quality score (Q), which is a function of the estimated probability that a base call is erroneous (P):

Q = -10 * log10(P)


For some specific probabilities and their rough qualitative interpretation for Illumina data:

Phred quality score Error probability Rough interpretation
10 1 in 10 terrible
20 1 in 100 bad
30 1 in 1,000 good
40 1 in 10,000 excellent

FASTQ quality scores (cont.)

This numeric quality score is represented in FASTQ files not by the number itself, but by a corresponding “ASCII character”.

This allows for a single-character representation of each possible score — as a consequence, each quality score character can conveniently correspond to (& line up with) a base character in the read.

Phred quality score Error probability ASCII character
10 1 in 10 +
20 1 in 100 5
30 1 in 1,000 ?
40 1 in 10,000 I

A rule of thumb

In practice, you almost never have to manually check the quality scores of bases in FASTQ files, but if you do, a rule of thumb is that letter characters are good (Phred of 32 and up).

FASTQ (cont.)

FASTQ files have no size limit, but in paired-end (PE) sequencing, forward and reverse reads are split into two files:
forward reads contain R1 and reverse reads contain R2 in the file name.

For example, having paired-end FASTQ files for 2 samples could look like this:

# A listing of (unusually simple) file names:
sample1_R1.fastq.gz
sample1_R2.fastq.gz
sample2_R1.fastq.gz
sample2_R2.fastq.gz

GFF/GTF

TBA

Questions?





(Back to the site)

Bonus slides

Sequencing technology development timeline

Modified after Pereira et al. 2020

Overcoming sequencing errors

Sequencing every bases multiple times, i.e. having a >1x so-called “depth of coverage” allows to infer the correct sequence:


  • Overcoming sequencing errors is made more challenging by natural genetic variation among and within individuals.
  • Typical depths of coverage: at least 50-100x for genome assembly; 10-30x for resequencing.

Genome size variation

https://en.wikipedia.org/wiki/Genome_size

https://en.wikipedia.org/wiki/Genome_size

Growth of genome databases

Konkel & Slot 2023